Towards Solving the Inverse Protein Folding Problem

نویسندگان

Yoojin Hong

Kyung Dae Ko

Gaurav Bhardwaj

Zhenhai Zhang

Damian B. van Rossum

Randen L. Patterson

چکیده

Accurately assigning folds for divergent protein sequences is a major obstacle to structural studies and underlies the inverse protein folding problem. Herein, we outline our theories for fold-recognition in the “twilight-zone” of sequence similarity (<25% identity). Our analyses demonstrate that structural sequence profiles built using Position-Specific Scoring Matrices (PSSMs) significantly outperform multiple popular homology-modeling algorithms for relating and predicting structures given only their amino acid sequences. Importantly, structural sequence profiles reconstitute SCOP fold classifications in control and test datasets. Results from our experiments suggest that structural sequence profiles can be used to rapidly annotate protein folds at proteomic scales. We propose that encoding the entire Protein DataBank (~1070 folds) into structural sequence profiles would extract interoperable information capable of improving most if not all methods of structural modeling. INTRODUCTION It has been proposed that the number of distinct native state protein folds is extremely limited(1). In addition, structure is more conserved than sequence similarity(1-3). Taken together, these attributes underscore the inverse protein folding problem; whereby the vast and varied numbers of primary amino acid sequences that exist in biology occupy a relatively limited number of structural folds. Due to the extreme divergence (≤25% pairwise identity) that can exist between structurally resolved (template) sequences and structurally unknown (target) sequences, foldclassification is often compromised. Thus, the crucial information specifying protein structure must be contained in a very small fraction of the amino acid sequence, making the informative points hard to measure. Therefore, a solution to the inverse protein folding problem must be able to identify these information points and use them to relate targets to appropriate template sequences. From a practical standpoint, the inverse protein folding problem manifests at various stages of structural modeling, depending on the method employed. For example, in homology-based threading techniques (e.g. Swiss-Model, Modeller, etc(4, 5)), unless the target is aligned to a structure of correct fold initially, the model will be inaccurate. For approaches that use physical modeling techniques (and hybrids thereof), as the fold is constructed from smaller structural units, they can often be assembled in multiple low-energy conformations. Mistakes can occur during the assembly of the fold using these approaches; further, they are often computationally expensive. The inverse protein folding problem has also been attacked using profiling methods such as FFAS03, SAM-T2K, and prof_sim(6-8). However, all of these algorithms experience a “glass-ceiling” whereby at relevant statistical limits, only <30% of benchmark structural datasets can be properly classified between sequences of ≤25% identity. Thus, if the correct fold for a given target sequence could be rapidly and reliably defined, most if not all structural modeling approaches could be improved. We recently theorized that a computational platform could be developed for sensitive homology detection and secondary structure annotation using rps-BLAST compatible PSSM libraries (9-11). In the present manuscript, we report that structural sequence profiles (i.e. fold-specific PSSM libraries) are a robust method of fold-classification which works in the “twilight-zone” of sequence similarity using simple algebra. Our findings demonstrate that structural sequence profiles are a new performance benchmark for the detection of distant structural homology. These results also provide support for our theories that sufficiently large PSSM libraries provide a solution to the inverse protein folding problem. RESULTS Fold-based Structural Sequence Profiles The power behind our structural sequence profiles is derived from libraries of PositionSpecific Scoring Matrices (PSSMs, i.e. profile) of functionally or structurally similar proteins, which contain a frequency table for substitutions that occur in related sequences; PSSMs are a powerful measure of homology. Indeed, it is well-established that PSSMs contain more information than individual sequences(12-14). We take advantage of the increased information content of PSSMs and quantify their alignments within a structural sequence profile. There are three features which make our method distinct from traditional sequence analysis methods. First, we measure targets with multiple structure-specific PSSM libraries. Second, we quantify low identity alignments, which are traditionally considered statistically insignificant. Third, we consider all relationships (to the same fold and different folds) to extract meaningful signals, which appear to be important for measurements in the “twilight zone”(9-11, 15). Alignment Comparisons and Information Content To test our method, we used the TZ-SABmark which is a carefully curated set of foldspecific sequences of remote homology(16). Each fold-specific sequence group represents a SCOP(17) fold classification of related sequences with ≤25% sequence identity. From the original TZ-SABmark, 534 sequences from 61 fold groups (avg. length of 135.27± 89.39 s.d.) selected at random were used as a test set. Out method involves three steps to infer remote structural homology between proteins (Fig. 1). First, we collect the sequences for each of the 61 test fold groups from the Protein Data Bank (PDB(18)). PDB sequences in the TZ-SABmark were excluded to avoid debate. Except for one fold group (SCOP fold b.1; Immunoglobulin-like beta-sandwich fold) which already has >1000 PDB sequences, the PDB sequences of all 60 fold groups were expanded by PSI-BLAST(13) search against NCBI NR database using themselves as targets. The sequences similar to the PDB sequences (≥90% identity) were removed. For each fold group, redundant or highly similar sequences (≥40% identity) were also eliminated. Fold-specific libraries for 61 fold groups were then built by generating PSSMs from the sequences obtained from PSI-BLAST. Following, fold-specific PSSMs were compiled as an rps-BLAST(19) compatible database (Fig. 1b). Second, each query sequence is then searched against the 61 fold-specific PSSM libraries using rps-BLAST. The alignments returned from the search are filtered out if they do not satisfy our e-value and coverage thresholds (i.e., alignment length as a function of library PSSM length). Third, given the alignments to a fold-specific library, a fold-specific score is calculated and encoded in structural sequence profile where each query is a vector of fold-specific scores (Fig. 1c, see Methods for more details). As a quantitative measure of how two targets are similar (i.e. the structural similarity score), we calculated Pearson’s correlation coefficient between their vectors. In the studies, alignments encoded in structural sequence profiles were collected using either of e-value 0.01, no coverage or e-value 10, 80% coverage thresholds. Sk = fold-specific score of protein i for fold k |+ ||++ +| |+| + ||+| | ... amino acid positions of a query protein sequence +2 for identical matches +1 for positive matches Alignments between query & Fold k PSSMs i. Positional scoring ii. Positional score normalization iii. Fold k-specific score of a query protein Norm. ps(i) = ps(i) – avg.(positional scores of all positions) ps(i) = positional score of ith amino acid ++|| ||+| | Sequence known as fold k PSI-BLAST NR database PSSM generation Foldk Fold1 Fold2 Fold3 Foldn ... Fold1 Fold2 Fold3 Foldn ... Sequences with known fold

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Solving the inverse problem of determining an unknown control parameter in a semilinear parabolic equation

The inverse problem of identifying an unknown source control param- eter in a semilinear parabolic equation under an integral overdetermina- tion condition is considered. The series pattern solution of the proposed problem is obtained by using the weighted homotopy analysis method (WHAM). A description of the method for solving the problem and nding the unknown parameter is derived. Finally, tw...

متن کامل

Solving an Inverse Heat Conduction Problem by Spline Method

In this paper, a numerical solution of an inverse non-dimensional heat conduction problem by spline method will be considered. The given heat conduction equation, the boundary condition, and the initial condition are presented in a dimensionless form. A set of temperature measurements at a single sensor location inside the heat conduction body is required. The result show that the proposed meth...

متن کامل

On the inverse maximum perfect matching problem under the bottleneck-type Hamming distance

Given an undirected network G(V,A,c) and a perfect matching M of G, the inverse maximum perfect matching problem consists of modifying minimally the elements of c so that M becomes a maximum perfect matching with respect to the modified vector. In this article, we consider the inverse problem when the modifications are measured by the weighted bottleneck-type Hamming distance. We propose an alg...

متن کامل

A modified VIM for solving an inverse heat conduction problem

In this paper, we will use a modified variational iteration method (MVIM) for solving an inverse heat conduction problem (IHCP). The approximation of the temperature and the heat flux at are considered. This method is based on the use of Lagrange multipliers for the identification of optimal values of parameters in a functional in Euclidian space. Applying this technique, a rapid convergent s...

متن کامل

A regularization method for solving a nonlinear backward inverse heat conduction problem using discrete mollification method

The present essay scrutinizes the application of discrete mollification as a filtering procedure to solve a nonlinear backward inverse heat conduction problem in one dimensional space. These problems are seriously ill-posed. So, we combine discrete mollification and space marching method to address the ill-posedness of the proposed problem. Moreover, a proof of stability and<b...

متن کامل

Solving Inverse Sturm-Liouville Problems with Transmission Conditions on Two Disjoint Intervals

‎In the present paper‎, ‎some spectral properties of boundary value problems of Sturm-Liouville type on two disjoint bounded intervals with transmission boundary conditions are investigated‎. ‎Uniqueness theorems for the solution of the inverse problem are proved‎, ‎then we study the reconstructing of the coefficients of the Sturm-Liouville problem by the spectrtal mappings method.

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1008.4938 شماره

صفحات -

تاریخ انتشار 2010

Towards Solving the Inverse Protein Folding Problem

نویسندگان

چکیده

منابع مشابه

Solving the inverse problem of determining an unknown control parameter in a semilinear parabolic equation

Solving an Inverse Heat Conduction Problem by Spline Method

On the inverse maximum perfect matching problem under the bottleneck-type Hamming distance

A modified VIM for solving an inverse heat conduction problem

A regularization method for solving a nonlinear backward inverse heat conduction problem using discrete mollification method

Solving Inverse Sturm-Liouville Problems with Transmission Conditions on Two Disjoint Intervals

عنوان ژورنال:

اشتراک گذاری